This study proposes a five-star rating system developed from cox-ph model to classify truck drivers risks and helps relevant companies in the industry to make informed decisions. More than three million rows of driver’s data are used to train the survival model. Based on a previous model developed by the company, this study is attempted to improve the prediction precision and to generalize the results for different fleets using more significant features. In the following parts of this report, we will discuss the background, data and the model in greater detail.
There are 700,000 trucks across Canada, 420,000 of which are used to carry freight commercially. Therefore, large amounts of data is produced and stored every day. To make the best out of these data, nowadays, fleet companies introduce many advanced techniques to boost their operations. Fleet managers are tasked with a wide range of responsibilities such as GPS-based vehicle tracking, mechanical diagnostics of log events, and most recently, driver behavior analysis using machine learning and even deep learning techniques. With the growing maturity of big data algorithms and applications, the operational management efficiency of the transportation industry will be improved significantly.
Fleetmetrica is a data analytics company in the transportation industry helping fleets leverage their telematics data for bottom-line results. Its solutions simplify the process of managing critical and time-consuming activities relating to key business issues such as safety, compliance, fuel, utilization, customer service, maintenance, and driver satisfaction and retention.
Fleetmetrica’s platform uses predictive analytics to automatically determine when and where to act in the fleet, and then makes this information available to the people that need it and in the format they want to receive it.
Previously, Fleetmetrica developed a survival model to evaluate the risk of accidents for each driver. The initial key findings resulting from monitoring driving performance with a telematics system and providing feedback to the driver as well as coaching and incentives for improved driving performance are as follows:
The resulting model shows some limited evidence of elevated risk for overspeeding, but only if driving on a highway. In addition, the risk of an accident drops when the driver spends a higher proportion of time overspeeding, which is unintuitive.
The risk of accidents is elevated over the baseline by an amount equal to:
where p is the average of the proportion of highway speeding pings to total pings. The quantity peaks around p = 0.05 and decreases as p gets larger.
Since Fleetmetrica’s initial model has limited explanatory power, trains and tests on one company’s (A) data, this research will explore:
There are three companies(i.e. A, B, C) data related to this study. We use company A’s data for model training and validation; we also tried to test on company B and C’s data as well to check the model’s applicability.
Company A’s data has 5 files in csv format:
Company B’s data has 4 files in csv format:
Company C’s data has 3 files in csv format:
To generalize the model, we need common features from all three companies; however if we want to improve the accuracy we should also explore company-specific features as well. Therefore, in the data-cleaning part, we will retain the common features of the three companies and company A-specific features at the same time. Since all files have the primary key – driver_id, we will join all the data-sets(overspeed, speeding_incidents, hard_brake) to trips and ran models on these joined data-sets. Data are not consistently available across the three companies : features like hard_brakes doesn’t exist for company B and overspeeding features are not available for company C. It is also worth mentioning that the accidents only have dates without exact time while relevant trips data is in the form of time-stamp. This could lead to inaccurate joining in some conditions. We tried to make use of the last trip as accident when there are multiple trips occurring on accident date, but finally didn’t keep this logic because it changed the distribution of the whole data-set (fitted model curve will be changed totally), which means this is a bad approximation. We will use the original data joining method, which will be elaborated later. The following plot shows what the accidents geographical distribution look like on a map:
We didn’t adopt the method of approximating the final trip as the accident when multiple trips occur because this incorrectly classifies last one multi-day trip as non-accident, which in the end totally changes the true accident trips, even though it increases accidents to more than 200.
For company A, we connect trips data with accidents, overspeeding, hard brakes and incidents data. Noticeably, the accidents number drops sharply from 1381 to 128. This is mainly because of the lack of corresponding trips data. This lack comes from two parts: 1) accidents are earlier than trips; 2) there are missing trips during many time periods. Aware of these characteristics, we will model this dataset with 128 accidents and 3007820 rows.
For company B, trips are joined with accidents, overspeeding and incidents. No hard brakes data is available. Accidents are also sparse compared with total number of rows in trips, just like what happened to company A. After cleaning, Company B has 97 accidents and 1559401 rows.
For company C, there are only 12 events available after joining(46 accident events before joining); more importantly, critical variable such as long-idle does not exist, and overspeed and incidents data is not provided as well. Therefore, we will not run testing model on company C because of data insufficiency. No further discussion will be centered on company C. The following figure shows how the trip datetime of company A and B overlap with accidents dates respectively.
| Raw | Cleaned | |||||
|---|---|---|---|---|---|---|
| # of features | trip obs. | accident obs. | # of features | trip obs. | accident obs. | |
| company A | 20 | 4,304,949 | 1381 | 21 | 3,007,820 | 128 |
| company B | 33 | 2,157,931 | 1171 | 21 | 1,559,401 | 97 |
To understand the response variable and test the integrity of the data, it is helpful to check the average occurrence of accidents for company A and company B by plotting its frequency every 1 million trips. Company B’s data is not complete while company A accident data is reasonable because we don’t have any accidents happening in month 2,3,4,5,6 for company B. The monthly accidents frequency is shown below:
Using the monthly average method, we calculate monthly means of variables and do correlation analyses for those means to get an idea of whether certain variables could be potentially correlated to accidents. If a variable is highly correlated to the response variable, monthly sample means of the variable should also be closely correlated to the monthly sample means of the response variable. Put it in another way, if the sample means of variables are not highly correlated with that of response variable, further exploration of the relationship could be fruitless. The correlations of the monthly means of selected variables are as follows:
This initial plot shows the correlation of variables’ means of each month from January to December. AccidProb is the monthly accident frequency per million trips; all variables in the matrix have monthly means with 12 data points. Ratio variables are calculated as follows:
These ratios could be potential useful predictors for modeling as some of them show high correlation to accident frequency.
To further show that features have an influence on accident, we carry out a case study of odds ratio analysis on different variables. An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the OR equals 1, i.e., the odds of one event are the same in either the presence or absence of the other event. If the OR is greater than 1, then A and B are associated (correlated) in the sense that, compared to the absence of B, the presence of B raises the odds of A, and symmetrically the presence of A raises the odds of B. Conversely, if the OR is less than 1, then A and B are negatively correlated, and the presence of one event reduces the odds of the other event. The following will show the variables influence on accidents and respective ORs are calculated and plotted.
Odds ratios can be used to improve model training accuracy by limiting variable values to certain upper bounds. Since we use whole raw data without over-cleaning, here we simply show representative examples of case study, namely idleDriveRatio(idle_ratio in the plot) and hRpmDriveRatio(rpm_ratio in the plot). Through re-sampling we run multiple case control of variables and plot the representative odds ratio as follows:
When idleDriveRatio is larger than 0.5 we see the odds ratio is greater than 2.5, which means that this variable is constantly contributing to accidents. This is also verified by correlation matrix using monthly mean value. When hRpmDriveRatio is larger than 0.15, odds ratio falls to 0, which means the higher rpm the lower accidents rates.
The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables. The Cox model is expressed by the hazard function denoted by h(t). Briefly, the hazard function can be interpreted as the risk of dying at time t. It can be estimated as follows:
where: t represents the survival time, h(t) is the hazard function, h0 is the baseline hazard.
In this study, survival time is when the driver do not have accidents.
##
## 0 1
## FALSE 2077028 67
## TRUE 930664 61
During model selection, we use an R library “mfp” to select polynomial terms of variables. Finally a model with log and interaction term is selected. The statistics p values and coefficients are satisfactory, as all predictors are significant at 0.001 level and the whole model is significant at 3e-06 level with a concordance of 0.623.
let’s test on the training set(we will discuss the over-fitting a little later), and check the prediction precision:
##
## 0 1
## FALSE 2075353 43
## TRUE 932339 85
The precision is 85/(85+43) = 0.66, with the 70% fitted value as the decision boundary. In other words, the model predicts 85 out of 128 accidents. As the percentile value of fitted value increases, the precision decreases. We will also look at F1_score of the model.
We take the fit that has the largest testing F1 score as the final model, which is given as follows:
## Call:
## coxph(formula = Surv(I(julianstarts - 0.1), julianends, eventPrev) ~
## log(distance + 1) + I(idleDriveRatio + 0.1) + I((idleDriveRatio +
## 0.1) * log((idleDriveRatio + 0.1))), data = cvi[[1]])
##
## n= 2393416, number of events= 93
##
## coef exp(coef)
## log(distance + 1) 0.4085 1.5045
## I(idleDriveRatio + 0.1) 1.6007 4.9566
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) -0.6258 0.5349
## se(coef) z
## log(distance + 1) 0.0978 4.177
## I(idleDriveRatio + 0.1) 0.3618 4.424
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) 0.1695 -3.691
## Pr(>|z|)
## log(distance + 1) 2.96e-05 ***
## I(idleDriveRatio + 0.1) 9.67e-06 ***
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) 0.000223 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef)
## log(distance + 1) 1.5045 0.6647
## I(idleDriveRatio + 0.1) 4.9566 0.2018
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) 0.5349 1.8697
## lower .95 upper .95
## log(distance + 1) 1.2421 1.8224
## I(idleDriveRatio + 0.1) 2.4390 10.0727
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) 0.3836 0.7457
##
## Concordance= 0.611 (se = 0.03 )
## Likelihood ratio test= 22.39 on 3 df, p=5e-05
## Wald test = 23.91 on 3 df, p=3e-05
## Score (logrank) test = 20.05 on 3 df, p=2e-04
The training set has 2393416 samples with 93 accidents(y=1). The out-of-sample testing has 614404 samples with 35 accidents (y=1). It is obvious that the F1_score is close to 0 because there are many false positive predictions. With such a low level of accident density in the whole data-set, predicting accidents with high recall is very challenging. Testing result on company A is shown below:
##
## 0 1
## FALSE 423929 10
## TRUE 190440 25
Schoenfeld residual diagnostics. To test whether the proportional hazard assumption holds(independence between residuals and time), the function cox.zph() correlates for each predictor the corresponding set of scaled Schoenfeld residuals with time. The higher p value, the better for the model. p>0.05 is good, which means there is no strong evidence for relation between residuals and the time. This model does not show evidence of dependence between time and residuals. We also saw an improvement of global p-value using bootstrapping.
## chisq df p
## log(distance + 1) 0.658 1 0.42
## I(idleDriveRatio + 0.1) 1.041 1 0.31
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) 0.761 1 0.38
## GLOBAL 2.536 3 0.47
Using the final fitted model, we have the following testing result on Company B’s data.
##
## 0 1
## FALSE 1080955 22
## TRUE 478349 75
In comparison, the testing result on B using original model has the following output:
result(fit0,b_data,70)
##
## 0 1
## FALSE 1078852 68
## TRUE 480452 29
Model comparison summary is as follows:
| Model Concordance | Model Significance: logrank | Predictors Significance | CompanyB Testing Precision | CompanyA Testing Precision | |
|---|---|---|---|---|---|
| Original | 0.558 | p=0.05 | p=0.01 | 29/97=0.30 | 0.51 |
| New | 0.623 | p=3e-06 | p=2e-05 | 75/97 = 0.77 | 0.71 |
more robust: model concordance and predictors significance are improved dramatically, which means the new model is statistically better;
more generalizable: New model’s out-of-sample testing precision is better than original model. which means the new model is better in practice. The following shows the trend of accident and non-accident predicting precision of the models.
In this part, we will discuss model outputs that can be used for managerial decisions. Specifically, a five interval risk evaluation system was proposed to help decision making. Limitations of this study is also discussed. Final part concludes the study.
The following histogram is the distribution of fitted value in this study. To make the decision of classifying drivers riskiness easier, the fitted model values are split into five groups, with boundaries marked in red in the plot. If a new driver’s data is fed into the model, the driver fitted values are calculated; then we check which interval the value falls into. A driver risk score is evaluated in this way. Based on company A data, we have the following decision boundary: (-∞,-0.376],(-0.376,-0.025],(-0.025,0.204],(0.204,0.428],(0.428,+∞).
From the plot above we know that using certain thresholds, we will classify the trips as either accidents or non-accident with fitted values. However the distribution is not clear cut; actually the two distributions overlap much in the middle. Finally we get a model with fair precision at the price of a low F1 score. It is quite challenging to improve this aspect under current data quality. Another concern is that through data joining, the original accidents number drop from over 1000 to around 100. It is the mismatch of records that leads to poor data quality. In addition, company B has no accidents available during Feb. to Jun. Therefore, to further improve the model, the solution is more matched trips and accidents data.
We conclude this study with the following model:
where,
And the decision boundary is :
| Decision Intervals | Driver Risks | Driver Ratings |
|---|---|---|
| (-∞,-0.376] | very low | ⭐⭐⭐⭐⭐ |
| (-0.376,-0.025] | low to medium | ⭐⭐⭐⭐ |
| (-0.025,0.204] | medium | ⭐⭐⭐ |
| (0.204,0.428] | medium to high | ⭐⭐ |
| (0.428,+∞) | very high | ⭐ |